Modeling

Preliminary thoughts:

Datathon vs Research vs Industry Project

Model for low prediction error explainable robust causal maintainable usability(prediction time, model size)
Datathon Yes - - - - -
Research Yes yes yes maybe - -
Industry Project yes yes yes maybe yes yes

Datathon:

  • Compare train and test data: are distribution identical, are there missing variables?
  • just optimize prediction error.
  • be aware: not necessarily the best model will win, but the model, that predicts the values on the test set best. (so it’s also important to understand the test set)

Research:

  • (non machine learning research, applied modeling) optimize prediction error, but also try to make model explainable, robust and, depending on the problem: causal.(prescriptive on top of predictive)

Industry Project:

  • Think about data for prediction: will it differ from training data, will it change over time? optimize prediction error, but also try to make model explainable, maintainable (including easily retrainable), robust, fast prediction time, and, depending on the problem: causal.(prescriptive on top of predictive)

We will need the following libraries:

(fat ones are necessary, the rest is optional)

  • tidyverse
  • tidymodels
  • glmnet
  • patchwork
  • DataExplorer
  • xgboost
  • vip
knitr::opts_chunk$set(echo = TRUE, cache = F, message = F, warning = F, comment = F)

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom        0.7.9      v rsample      0.1.0 
## v dials        0.0.10     v tune         0.1.6 
## v infer        1.0.0      v workflows    0.2.3 
## v modeldata    0.1.1      v workflowsets 0.1.0 
## v parsnip      0.1.7      v yardstick    0.0.8 
## v recipes      0.1.16
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(patchwork)
library(vip)
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
theme_set(theme_bw())

ncores <- 4

Data modeling lifecycle

  • Import - Reading in the data
  • Tidy - Data cleaning
  • Data transformation - Feature engineering and Feature selection
  • Modeling - Hyper parameter optimization, Model evaluation
  • Visualization
  • Communication

TidyModels

https://www.tidymodels.org/packages/

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.
Here is a selection of packages we will use:

https://rsample.tidymodels.org/ With the functions from the rsamples package, we split our data into different folds for 5-fold cross validation
https://recipes.tidymodels.org/ We will use the Recipes package to transform the data. It creates a data transformation recipe, that can be applied to various datasets.
https://parsnip.tidymodels.org/ We use the parsnip package to define our models
https://workflows.tidymodels.org/ The workflow package creates a workflow for us, it defines the model and the data transformation we want to use
https://tune.tidymodels.org/ The tune package can be used to tune the hyperparameters of the model (for example the number of trees, if we use a random forrest, or the maximum size of a tree)
https://dials.tidymodels.org/ .. and the dials package creates and manages tuning parameters and parameter grids.
https://yardstick.tidymodels.org/ The yardstick package provides us with functions to evaluate the performance of our model

Import - reading the data

trainval_data <- read_csv("data/train.csv")
test_data <- read_csv("data/test.csv")
names(trainval_data)
FALSE  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
FALSE  [5] "LotArea"       "Street"        "Alley"         "LotShape"     
FALSE  [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
FALSE [13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
FALSE [17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
FALSE [21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
FALSE [25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
FALSE [29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
FALSE [33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
FALSE [37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
FALSE [41] "HeatingQC"     "CentralAir"    "Electrical"    "1stFlrSF"     
FALSE [45] "2ndFlrSF"      "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
FALSE [49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
FALSE [53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
FALSE [57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
FALSE [61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
FALSE [65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
FALSE [69] "EnclosedPorch" "3SsnPorch"     "ScreenPorch"   "PoolArea"     
FALSE [73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
FALSE [77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
FALSE [81] "SalePrice"

Tidy - data cleaning

Clean the data set

Quick overview of independent variables / features

p <- DataExplorer::plot_missing(trainval_data, missing_only = T)

cols_to_remove <- p$data %>% filter(Band %in% c("Remove", "Bad")) %>% pull(feature)
cols_to_remove
FALSE [1] Alley       FireplaceQu PoolQC      Fence       MiscFeature
FALSE 81 Levels: PoolQC MiscFeature Alley Fence FireplaceQu ... SalePrice
Fill in known NAs
data <- bind_rows(
    trainval = trainval_data,
    test = test_data,
    .id = "source"
)
data <- data %>% replace_na(
    list(
        Alley = "No Alley Access",
        BsmtQual = "No Basement",
        BsmtCond = 'No Basement',
        BsmtExposure = "No Basement",
        BsmtFinType1 = "No Basement",
        BsmtFinType2 = "No Basement",
        FireplaceQu = "No Fireplace",
        GarageType = "No Garage",
        GarageFinish = "No Garage",
        GarageQual = "No Garage",
        GarageCond = "No Garage",
        PoolQC = "No Pool",
        Fence = "No Fence",
        MiscFeature = "None"
    )
)

trainval_data <- data %>%
    filter(source == "trainval") %>%
    select(-source)
test_data <- data %>%
    filter(source == "test") %>%
    select(-source)
p <- DataExplorer::plot_missing(trainval_data, missing_only = T)

cols_to_remove <- p$data %>%
    filter(Band %in% c("Remove", "Bad")) %>%
    pull(feature)
cols_to_remove
FALSE factor(0)
FALSE 81 Levels: LotFrontage GarageYrBlt MasVnrType MasVnrArea Electrical ... SalePrice

Exploratory Data Analysis - Understanding the data + feature engineering

Feature engineering (and feature selection) are often the most important parts in the modeling process.

  • What do the features (variables) mean?
  • What other feature could be useful and can be created from the features you have?
The target variable: Price
ggplot(trainval_data, aes(x=SalePrice)) +
    geom_histogram(aes(y=..density..), colour = "black", alpha = .2) +
    geom_density(fill="red", alpha=.2) +
    labs(x = "", y = "")

here: prices are strictly positive (above zero). Should a house be twice as expensive, if it’s twice as large?

If the dependence on our variables is multiplicative (i.e. the price doubles, when the size doubles, a bathroom increases the price of the house by a certain percentage, i.e the absolute increase is higher for an expensive house, than for a cheap house), then we want to use log-linear regression, or poisson regression.

Example1:
\(Y = a_1 X_1 + a_2 X_2 + a_3 X_3 + ... + a_n X_n\) - features are additive
\(log(Y) = b_1 log(X_1) + b_2 log(X_2) + b_3 log(X_3)+ ... + b_n log(X_n)\) - features are multiplicative!

by using a log-transformation on your (numerical) features, you can change how they influence the model outcome.

New Target variable: Log (SalePrice)
dependent variables will be transformed later..

trainval_data <- trainval_data %>% mutate(logSalePrice = log(SalePrice))

# hist
g1 <- ggplot(trainval_data, aes(x = logSalePrice)) +
    geom_histogram(aes(y = ..density..), colour = "black", fill = "white") +
    geom_density(alpha = .2, fill = "#FF6666") +
    labs(x = "", y = "")

# boxplot
g2 <- ggplot(trainval_data, aes(y = logSalePrice)) +
    geom_boxplot(aes(x = ""), colour = "black", fill = "white") +
    coord_flip() +
    labs(x = "", y = "")

# qqplot
g3 <- ggplot(trainval_data, aes(sample = logSalePrice)) +
    stat_qq() +
    stat_qq_line() +
    labs(x = "", y = "")

g3 | g1 / g2

Some ideas for creating new features

transformation example
non-linear transform example: log-transform on both target and features, if features are multiplicative rather than additive. Only on features, if the target depends logarithmically on the feature
indicator function if a feature is present or not, i.e. if we’re only interested if the house has a kitchen, but not how many, or the increase in the number of kitchens is not important
f(feature1, feature2) if two features interact, we can create a new feature from the interaction, for example: the average number of squarefeet per room : number of rooms/total living surface
binning if we are not interested in a very fine grained information, we can bin the data, for example, the age of the house in steps of 20 years
decorrelations if variables are highly correlated, we can think about decorrelating them, either by using PCA (and variable reduction) or by creating variables that are less correlated

Correlations

trainval_data %>%
    select(where(is.numeric)) %>%
    cor( method="kendall", use="pairwise.complete.obs") %>%
    corrplot::corrplot()

Outliers

especially if we use models that are not robust, like linear regression, outliers can severely skew our results. Sometimes it can be good, to remove outliers from the data before training the model.

 g1 / g2

# identify points in the extreme 5% of the observed data
trainval_data %>% filter(
    logSalePrice > quantile(logSalePrice, .975) | # OR
        logSalePrice < quantile(logSalePrice, .025)) %>%
    select(SalePrice, Alley, LotArea, PoolArea, MiscFeature)
FALSE # A tibble: 72 x 5
FALSE    SalePrice Alley           LotArea PoolArea MiscFeature
FALSE        <dbl> <chr>             <dbl>    <dbl> <chr>      
FALSE  1     68500 No Alley Access    6324        0 None       
FALSE  2     40000 Pave               8500        0 None       
FALSE  3    385000 No Alley Access   50271        0 None       
FALSE  4    438780 No Alley Access   13682        0 None       
FALSE  5     79000 No Alley Access    9600        0 None       
FALSE  6    412500 No Alley Access   13688        0 None       
FALSE  7    501837 No Alley Access   17423        0 None       
FALSE  8    475000 No Alley Access   22950        0 None       
FALSE  9    386250 No Alley Access   13472        0 None       
FALSE 10    403000 No Alley Access   15138        0 None       
FALSE # ... with 62 more rows
# to remove these, invert the process;
# trainval_data <- trainval_data %>% filter(
#    logSalePrice < quantile(logSalePrice, .975) & # AND
#        logSalePrice > quantile(logSalePrice, .025))

you may also want to look at outliers in the features.

Modeling pipeline

splitting the data into train and validation set and using cross validation.
https://rsample.tidymodels.org/ With the functions from the rsamples package, we split our data into different folds for 5-fold cross validation. We can split it in such a way, that in all folds, the distribution of our target variable is identical.

Train- Validation - Test split

In order to be able to evaluate your model quality, you need a train and a validation set. You train the model on the training set and evaluate it on the evaluation set. We split our training data, so that 80% of it will be the training set, and 20% the validation set.

set.seed(24)
data_split <- initial_split(trainval_data, prop = 4/5, strata = 'logSalePrice', breaks = 10,)
train_data <- training(data_split)
val_data <- testing(data_split)
cat('train+validation data size is:',dim(trainval_data),'\n')
FALSE train+validation data size is: 1460 82
cat('train data size is:',dim(train_data),'\n')
FALSE train data size is: 1165 82
cat('validation data size is:',dim(val_data),'\n')
FALSE validation data size is: 295 82
cat('test data size is:',dim(test_data))
FALSE test data size is: 1459 81

data preparation for cross validation

set.seed(42)
SalePrice_vfold <- vfold_cv(train_data, v = 5, strata = SalePrice)

Let’s start with recipes..

https://recipes.tidymodels.org/ We will use the Recipes package to transform the data. It creates a data transformation recipe, that can be applied to various datasets. A Recipes is an instruction how to change a dataset. It can be applied to the training data before modeling,or to the test data before prediction
recipe pipeable sequences of feature engineering steps to get your data ready for modeling
update_role update the role of the independent variables (predictor or identifier: identifiers are not used for prediction)
step_rm removes columns from dataframe
step_unknown Assign missing categories to “unknown” (or a name of your choice)
step_log Logarithmic Transformation
step_normalize all numeric variables are normalized
step_other factors are grouped together to the “other” class, if they are less frequent than the threshold
step_novel if previously unseen factors are encounter, they are assigned to the “new” level
step_impute_knn missing values are imputed with the k-next neighbor algorithm
step_dummy ategorical variables are one-hot encoded
step_nzv variables with near zero variance are removed

There are many more options, get all information here: https://recipes.tidymodels.org/reference/index.html

SalePrice_recipe <- recipe(
    # set the target(dependent) and the feature (independent) variables
    train_data, logSalePrice ~ .) %>%

    #update the role of the independent variables (predictor or identifier)
    update_role(Id, new_role = "ID") %>%

    # remove columns we don't need
    step_rm(Id, SalePrice) %>%
    step_rm(any_of(cols_to_remove)) %>%

    # # giving missing values a label
    # step_unknown(c('BsmtQual','BsmtCond'), new_level='No Basement') %>%

    # apply a log transformation to all numeric predictors
    step_log(all_numeric(),-all_outcomes(), offset = 1) %>%

    # normalize all numeric predictors
    step_normalize(all_numeric(),-all_outcomes()) %>%

    # group together factor levels that occur less than the threshold
    step_other(all_nominal(), -all_outcomes(), threshold = 0.01) %>%

    # if previously unseen factors are encounter, they are given the "new" level
    step_novel(all_predictors(), -all_numeric()) %>%

    # impute missing variables with k nearest neighbours
    step_impute_knn(all_predictors()) %>%

    # remove predictors with near-zero variance
    step_nzv(all_predictors(), freq_cut=995/5) %>%

    # add dummy variables for all nominal predictors
    step_dummy(all_nominal(), -all_outcomes())
Think about:

for which columns does knn - imputation make sense? How should the other columns be imputed?

The recipe can be ‘trained’ on one dataset …

with the prep() method

SalePrice_recipe2 <- prep(SalePrice_recipe, training=train_data)

tidy(SalePrice_recipe)
FALSE # A tibble: 9 x 6
FALSE   number operation type       trained skip  id              
FALSE    <int> <chr>     <chr>      <lgl>   <lgl> <chr>           
FALSE 1      1 step      rm         FALSE   FALSE rm_XMBSP        
FALSE 2      2 step      rm         FALSE   FALSE rm_dex5l        
FALSE 3      3 step      log        FALSE   FALSE log_9KhLu       
FALSE 4      4 step      normalize  FALSE   FALSE normalize_eBkTV 
FALSE 5      5 step      other      FALSE   FALSE other_QKTAI     
FALSE 6      6 step      novel      FALSE   FALSE novel_EXG5i     
FALSE 7      7 step      impute_knn FALSE   FALSE impute_knn_PpuSA
FALSE 8      8 step      nzv        FALSE   FALSE nzv_w2ycB       
FALSE 9      9 step      dummy      FALSE   FALSE dummy_EUsSZ
tidy(SalePrice_recipe2)
FALSE # A tibble: 9 x 6
FALSE   number operation type       trained skip  id              
FALSE    <int> <chr>     <chr>      <lgl>   <lgl> <chr>           
FALSE 1      1 step      rm         TRUE    FALSE rm_XMBSP        
FALSE 2      2 step      rm         TRUE    FALSE rm_dex5l        
FALSE 3      3 step      log        TRUE    FALSE log_9KhLu       
FALSE 4      4 step      normalize  TRUE    FALSE normalize_eBkTV 
FALSE 5      5 step      other      TRUE    FALSE other_QKTAI     
FALSE 6      6 step      novel      TRUE    FALSE novel_EXG5i     
FALSE 7      7 step      impute_knn TRUE    FALSE impute_knn_PpuSA
FALSE 8      8 step      nzv        TRUE    FALSE nzv_w2ycB       
FALSE 9      9 step      dummy      TRUE    FALSE dummy_EUsSZ

…and applied to another dataset

with the bake() method

bake(SalePrice_recipe2, train_data)
FALSE # A tibble: 1,165 x 238
FALSE    MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd
FALSE         <dbl>       <dbl>   <dbl>       <dbl>       <dbl>     <dbl>        <dbl>
FALSE  1      0.990      0.244    0.338      -1.61       -0.461   -0.133        -0.876
FALSE  2     -0.568     -0.267   -0.666      -1.61        0.437   -1.46         -1.71 
FALSE  3      0.631     -0.775   -0.110      -1.61       -1.52    -1.70         -1.71 
FALSE  4      0.990     -0.0427  -0.752      -1.61       -0.461   -0.528        -1.47 
FALSE  5      0.729     -0.267   -0.422      -0.725       1.22    -1.70          0.532
FALSE  6     -0.568     -0.948   -1.26       -1.61        0.437   -0.860        -1.71 
FALSE  7      1.99      -3.16    -3.25       -1.61       -0.461    0.0641       -0.583
FALSE  8      0.152      1.30    -0.117      -2.70       -4.50    -1.86         -0.145
FALSE  9     -1.13       0.710   -0.110      -0.725      -2.82    -0.330        -1.17 
FALSE 10     -1.13       0.282    0.360      -1.61       -0.461   -0.199        -0.974
FALSE # ... with 1,155 more rows, and 231 more variables: MasVnrArea <dbl>,
FALSE #   BsmtFinSF1 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>, 1stFlrSF <dbl>,
FALSE #   2ndFlrSF <dbl>, GrLivArea <dbl>, BsmtFullBath <dbl>, BsmtHalfBath <dbl>,
FALSE #   FullBath <dbl>, HalfBath <dbl>, BedroomAbvGr <dbl>, KitchenAbvGr <dbl>,
FALSE #   TotRmsAbvGrd <dbl>, Fireplaces <dbl>, GarageYrBlt <dbl>, GarageCars <dbl>,
FALSE #   GarageArea <dbl>, WoodDeckSF <dbl>, OpenPorchSF <dbl>, EnclosedPorch <dbl>,
FALSE #   MoSold <dbl>, YrSold <dbl>, logSalePrice <dbl>, MSZoning_RH <dbl>, ...

Modeling

https://parsnip.tidymodels.org/ We use the parsnip package to define our models

Here we will use Lasso, but there are other models you can use as well. Have a look here: https://parsnip.tidymodels.org/reference/index.html

Lasso Model

regularized linear regression

get information on linear regression models here: https://parsnip.tidymodels.org/reference/linear_reg.html
and for this one in particular here: https://parsnip.tidymodels.org/reference/details_linear_reg_glmnet.html

Lasso_model <- parsnip::linear_reg(
        mixture = 1,          #here we specify that we want a Lasso model, with L1 regularization.
        penalty = tune()) %>% #the penalty will be specified by tuning the model
    set_engine("glmnet") %>%
    translate()

Modeling Workflow

we add a model and a data transformation recipe to our workflow.
Then, we can fit the model together with the transformed data using the workflow.

https://workflows.tidymodels.org/ The workflow package creates a workflow for us, it defines the model and the data transformation we want to use,
Workflow_SalePrice_lasso <-
  workflow() %>%
  add_model(Lasso_model) %>%
  add_recipe(SalePrice_recipe)

Hyperparameter tuning

https://tune.tidymodels.org/ The tune package can be used to tune the hyperparameters of the model (for example the number of trees, if we use a random forrest, the learning rate or the maximum depth of a tree)and the dials package creates and manages tuning parameters and parameter grids https://tune.tidymodels.org/
  • select a set of hyper parameters to explore. We use the dials package, which provides us with default values, that we can also adjust.
    Have a look here: https://dials.tidymodels.org/reference

  • tune the hyperparameters. For example with grid search or bayesian optimization.
    These algorithms use 5-fold cross validation to choose the hyperparameters for which the model performs best (across all folds).

  • look at the optimal hyper parameter values. If they happen to be at the border of our parameter range, it might be good to extend the range and repeat the hyperparameter tuning.

Bayesian Optimization

(https://en.wikipedia.org/wiki/Bayesian_optimization).
At each iteration step the most promising set of parameters for improving the model is chosen.
Uses e a mixture of:

  • exploitation (knowing which parameters worked well in the past) and
  • exploration (choosing areas in the parameter space, that have not been explored yet).
set.seed(42)

lasso_params <- parameters(penalty())
lasso_tune_bayes <- Workflow_SalePrice_lasso %>%
    tune_bayes(
        resamples = SalePrice_vfold,
        param_info = lasso_params,
        # initial = ?,
        iter = 30,
        metrics = metric_set(rmse),
        control = control_bayes(
            no_improve = 5,
            save_pred = T,
            verbose = T
        )
    )
# doParallel::stopImplicitCluster()
lasso_tune_grid %>% collect_metrics()
FALSE # A tibble: 30 x 7
FALSE     penalty .metric .estimator  mean     n std_err .config              
FALSE       <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
FALSE  1 0.0001   rmse    standard   0.138     5 0.00818 Preprocessor1_Model01
FALSE  2 0.000127 rmse    standard   0.138     5 0.00818 Preprocessor1_Model02
FALSE  3 0.000161 rmse    standard   0.138     5 0.00815 Preprocessor1_Model03
FALSE  4 0.000204 rmse    standard   0.137     5 0.00812 Preprocessor1_Model04
FALSE  5 0.000259 rmse    standard   0.137     5 0.00810 Preprocessor1_Model05
FALSE  6 0.000329 rmse    standard   0.136     5 0.00809 Preprocessor1_Model06
FALSE  7 0.000418 rmse    standard   0.136     5 0.00813 Preprocessor1_Model07
FALSE  8 0.000530 rmse    standard   0.135     5 0.00817 Preprocessor1_Model08
FALSE  9 0.000672 rmse    standard   0.134     5 0.00819 Preprocessor1_Model09
FALSE 10 0.000853 rmse    standard   0.134     5 0.00823 Preprocessor1_Model10
FALSE # ... with 20 more rows

visualizing the results and finding the best hyper parameter

lasso_tune_bayes %>%
    collect_metrics() %>%
    select(-.iter) %>%
    add_column(tune = 'bayes') %>%
    rbind(lasso_tune_grid %>%
              collect_metrics() %>%
              add_column(tune = 'grid')) %>%
    ggplot(aes(x = penalty, y = mean, color = .metric)) +
    geom_point(aes(shape = tune, size = tune)) +
    geom_line() +
    ylab("RMSE") +
    scale_x_log10(labels = scales::label_number()) +
    scale_size_manual(values = c(4, 2)) +
    ggtitle("dependency of the model's prediction error (RMSE) on the penalty hyper-parameter")

lasso_tune_grid %>% show_best("rmse", n = 1)
FALSE # A tibble: 1 x 7
FALSE   penalty .metric .estimator  mean     n std_err .config              
FALSE     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
FALSE 1 0.00356 rmse    standard   0.132     5 0.00788 Preprocessor1_Model16
lasso_tune_bayes %>% show_best("rmse", n = 1)
FALSE # A tibble: 1 x 8
FALSE   penalty .metric .estimator  mean     n std_err .config .iter
FALSE     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>   <int>
FALSE 1 0.00216 rmse    standard   0.132     5 0.00831 Iter3       3
updating the workflow with the best performing model
#choose the hyperparameters that worked best
lasso_best_params <- select_best(lasso_tune_grid, "rmse")

#update the hyper parameters of the model
SalePrice_final_model <- finalize_model(Lasso_model, lasso_best_params)

# update the workflow with the updated model
Workflow_SalePrice_lasso <- Workflow_SalePrice_lasso %>% update_model(SalePrice_final_model)

Model evaluation

https://yardstick.tidymodels.org/ The yardstick package provides us with functions to evaluate the performance of our model

average Model Error across all folds

(see error used to find the best hyperparameters)

set.seed(42)
lasso_fit_rs <-
    Workflow_SalePrice_lasso %>%
    fit_resamples(SalePrice_vfold)
collect_metrics(lasso_fit_rs)
FALSE # A tibble: 2 x 6
FALSE   .metric .estimator  mean     n std_err .config             
FALSE   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
FALSE 1 rmse    standard   0.132     5 0.00788 Preprocessor1_Model1
FALSE 2 rsq     standard   0.893     5 0.0132  Preprocessor1_Model1

Model error on the validation data set

final_res_lasso <- last_fit(Workflow_SalePrice_lasso, split = data_split)
final_res_lasso$.metrics[[1]]
FALSE # A tibble: 2 x 4
FALSE   .metric .estimator .estimate .config             
FALSE   <chr>   <chr>          <dbl> <chr>               
FALSE 1 rmse    standard       0.116 Preprocessor1_Model1
FALSE 2 rsq     standard       0.914 Preprocessor1_Model1

Predicting values on the validation data set and Visualizing the prediction Error

SalePrice_lasso_fit  <- fit(Workflow_SalePrice_lasso, data = train_data)

pred_lasso <-
  predict(SalePrice_lasso_fit, val_data) %>%
  mutate(model = "regression",
         .pred = exp(.pred)) %>%
  bind_cols(val_data %>% select(SalePrice))

g3 <-
  pred_lasso %>%
  ggplot(aes(y = .pred, x = SalePrice))+
  geom_point()+
  geom_abline(intercept = 0, col = "red")+
  ggtitle('Lasso')


g4 <-
  pred_lasso %>%
  select(.pred, SalePrice) %>%
  gather(key, value) %>%
  ggplot(aes(x=value, volor = key, fill = key)) +
  geom_density(alpha=.2)+
  labs(x = "", y = "")

g3/g4

Feature importance

tidy( SalePrice_lasso_fit ) %>%
    arrange(desc(estimate)) %>%
    head(n=10)
FALSE # A tibble: 10 x 3
FALSE    term                 estimate penalty
FALSE    <chr>                   <dbl>   <dbl>
FALSE  1 (Intercept)           11.9    0.00356
FALSE  2 Neighborhood_StoneBr   0.139  0.00356
FALSE  3 GrLivArea              0.129  0.00356
FALSE  4 Neighborhood_NridgHt   0.120  0.00356
FALSE  5 Neighborhood_Crawfor   0.101  0.00356
FALSE  6 Neighborhood_NoRidge   0.0893 0.00356
FALSE  7 SaleType_New           0.0891 0.00356
FALSE  8 OverallQual            0.0687 0.00356
FALSE  9 Exterior1st_BrkFace    0.0685 0.00356
FALSE 10 Neighborhood_Somerst   0.0648 0.00356

Prediction on Test set and Writing the submission file

final_fit <-fit(Workflow_SalePrice_lasso, data = trainval_data)

pred_lasso <-
  predict(final_fit, test_data) %>%
  transmute(SalePrice = exp(.pred)) %>% # here we need to transform the prediction again, as we were working with logarithmic prices.
  bind_cols(test_data %>% select(Id)) %>%
  write_csv("submission.csv")

… aand again!

with XGBoost

Model definition

Xgb_model <-
  parsnip::boost_tree(
        trees = tune(),      # the hyperparameters will be specified later, when we do hyperparameter tuning
        learn_rate = tune(), # but you can also set some fixed here.
        tree_depth = tune(),
        min_n = tune(),
        loss_reduction = tune(),
        sample_size = tune(),
        mtry = tune(),
  ) %>%
  set_mode("regression") %>%
  set_engine("xgboost", nthread = ncores)

#This function only defines what type of model is being fit.
#Once an engine is specified, the method to fit the model is also defined.

Workflow setup

Workflow_SalePrice_xgb <- workflow() %>%
    add_model(Xgb_model) %>%
    add_recipe(SalePrice_recipe) # re-using the same data pre-processing recipe

Hyperparameter tuning

To get a list of all hyperparameters and what they do, look here: https://parsnip.tidymodels.org/reference/boost_tree.html or here: https://parsnip.tidymodels.org/reference/details_boost_tree_xgboost.html

xgboost_params <- parameters(
    trees(),
    learn_rate(),
    tree_depth(),
    min_n(),
    loss_reduction(),
    sample_size = sample_prop(),
    finalize(mtry(), train_data)  # the parameter range of mtry (how many features are chosen at random for each tree)
)                                   # depends on the data set (the number of features). With "finalize",
# we let the function determine the upper bound given the data set.

xgboost_params <- xgboost_params %>%
    update(trees = trees(c(100, 500))) #parameter tuning ranges can also be updated later on.

-> we have many hyperparameters we want to optimize. Bayesian optimization is a better choice here.

When we tune the hyperparameters, we use 5-fold cross validation. The model is trained on 4 of the 5 folds and evaluated on the 5th fold. The algorithm then chooses the hyperparameters for which the model performs best across all folds. What we need:

  • the workflow containing the data transformation and information on target and feature variables as well as the model (workflow_SalePrice_xgb_model
  • the cross-validation splitting of the data (SalePrice_vfold)
  • the parameter ranges for the model (xgboost_params)
  • for how many iterations we want to select different parameter sets
  • the metrics that we want to use to evaluate the model
  • the method for hyperparameter optimization
set.seed(42)

# this may take a while...
# feel free to set the `initial`, `iter`, and `no_improve` paramaters lower
# for a faster (but less thorough) search.
xgboost_tune <- Workflow_SalePrice_xgb %>%
    tune_bayes(
        resamples = SalePrice_vfold,
        param_info = xgboost_params,
        initial = 10,
        iter = 30,
        metrics = metric_set(rmse, mape),
        control = control_bayes(
            no_improve = 5,
            save_pred = T,
            verbose = T
        )
    )

# doParallel::stopImplicitCluster()
autoplot(xgboost_tune)

xgb_best_params <- select_best(xgboost_tune, "mape")
SalePrice_final_model <- finalize_model(Xgb_model, xgb_best_params)
Workflow_SalePrice_xgb <- Workflow_SalePrice_xgb %>% update_model(SalePrice_final_model)

Model evaluation

final_res_xgb <- last_fit(Workflow_SalePrice_xgb, split = data_split)
final_res_xgb$.metrics[[1]]
FALSE # A tibble: 2 x 4
FALSE   .metric .estimator .estimate .config             
FALSE   <chr>   <chr>          <dbl> <chr>               
FALSE 1 rmse    standard       0.120 Preprocessor1_Model1
FALSE 2 rsq     standard       0.908 Preprocessor1_Model1
SalePrice_xgb_fit <- fit(Workflow_SalePrice_xgb, data = train_data)

pred_xgb <-
    predict(SalePrice_xgb_fit, val_data) %>%
    mutate(modelo = "XGBoost",
           .pred = exp(.pred)) %>%
    bind_cols(val_data %>% select(SalePrice))

g1 <-
    pred_xgb %>%
    ggplot(aes(y = .pred, x = SalePrice)) +
    geom_point() +
    geom_abline(intercept = 0, col = "red") +
    ggtitle('XGBoost')


g2 <-
    pred_xgb %>%
    select(.pred, SalePrice) %>%
    gather(key, value) %>%
    ggplot(aes(x = value, volor = key, fill = key)) +
    geom_density(alpha = .2) +
    labs(x = "", y = "")

g1 / g2

Feature importance

SalePrice_xgb_fit %>%
    extract_fit_parsnip() %>%
    vip()